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Abstract 

The No Free Lunch theorems are often used to argue that domain specific 
knowledge is required to design successful algorithms. We use algorithmic in- 
formation theory to argue the case for a universal bias allowing an algorithm 
to succeed in all interesting problem domains. Additionally, we give a new al- 
gorithm for off-line classification, inspired by Solomonoff induction, with good 
performance on all structured (compressible) problems under reasonable as- 
sumptions. This includes a proof of the efficacy of the well-known heuristic of 
randomly selecting training data in the hope of reducing the misclassification 
rate. 
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1 Introduction 



The No Free Lunch (NFL) theorems, stated and proven in various settings and 
domains |Sch94, WolOlJ IWM97] . show that no algorithm performs better than any 
other when their performance is averaged uniformly over all possible problems of a 
particular typeQ These are often cited to argue that algorithms must be designed 
for a particular domain or style of problem, and that there is no such thing as a 
general purpose algorithm. 

On the other hand, Solomonoff induction [Sol64a, Sol64b] and the more general 
AIXI model |Hut04] appear to universally solve the sequence prediction and rein- 
forcement learning problems respectively. The key to the apparent contradiction is 
that Solomonoff induction and AIXI do not assume that each problem is equally 
likely. Instead they apply a bias towards more structured problems. This bias is 
universal in the sense that no class of structured problems is favored over another. 
This approach is philosophically well justified by Occam's razor. 

The two classic domains for NFL theorems are optimisation and classification. 
In this paper we will examine classification and only remark that the case for opti- 
misation is more complex. This difference is due to the active nature of optimisation 
where actions affect future observations. 

Previously, some authors have argued that the NFL theorems do not disprove 
the existence of universal algorithms for two reasons. 

1. That taking a uniform average is not philosophically the right thing to do, as 
argued informally in |GCP05j . 

2. Carroll and Seppi in |CS07] note that the NFL theorem measures performance 
as misclassification rate, where as in practise, the utility of a misclassification 
in one direction may be more costly than another. 

We restrict our consideration to the task of minimising the misclassification rate 
while arguing more formally for a non-uniform prior inspired by Occam's razor and 
formalised by Kolmogorov complexity. We also show that there exist algorithms 
(unfortunately only computable in the limit) with very good properties on all struc- 
tured classification problems. 

The paper is structured as follows. First, the required notation is introduced 
(Section 2). We then state the original NFL theorem, give a brief introduction to 
Kolmogorov complexity, and show that if a non-uniform prior inspired by Occam's 
razor is used, then there exists a free lunch (Section 3). Finally, we give a new 
algorithm inspired by Solomonoff induction with very attractive properties in the 
classification problem (Section 4). 

1 Such results have been less formally discussed long before by Watanabe in 1969 [WD69 . 
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2 Preliminaries 



Here we introduce the required notation and define the problem setup for the No 
Free Lunch theorems. 

Strings. A finite string x over alphabet X is a finite sequence X1X2X3 ■ ■ ■ x n -\X n 
with Xi E X. An infinite string x over alphabet X is an infinite sequence X1X2X3 
Alphabets are usually countable or finite, while in this paper they will almost always 
be binary. For finite strings we have a length function defined by £(x) := n for 
x — x\Xi ■ • ■ x n . The empty string of length is denoted by e. The set X n is the set 
of all strings of length n. The set X* is the set of all finite strings. The set X°° is 
the set of all infinite strings. Let a; be a string (finite or infinite) then substrings are 
denoted x s:t : = x s x s+ i ■ ■ ■ x t -ix t where s < t. A useful shorthand is x <t '■— X\ :t -i. 
Let x, y G X* and z G X°° with £(x) = n and £(y) = m then 

Xy := X X X 2 , ■ ■ ■ X n ^ n y\V2 • • • Vm-lVm 
XZ '.= X1X2, ■ ■ ■ X n —\X n Z\ZlZ% ■ ■ ■ 

As expected, xy is finite and has length £(xy) = n + m while xz is infinite. For 
binary strings, we write #l(x) and #0(x) to mean the number of O's and number 
of l's in x respectively. 

Classification. Informally, a classification problem is the task of matching features 
to class labels. For example, recognizing handwriting where the features are images 
and the class labels are letters. In supervised learning, it is (usually) unreasonable to 
expect this to be possible without any examples of correct classifications. This can 
be solved by providing a list of feature/class label pairs representing the true clas- 
sification of each feature. It is hoped that these examples can be used to generalize 
and correctly classify other features. 

The following definitions formalize classification problems, algorithms capable of 
solving them, as well as the loss incurred by an algorithm when applied to a problem, 
or set of problems. The setting is that of transductive learning as in |DEyM04|. 

Definition 1 (Classification Problem). Let X and Y be finite sets representing the 
feature space and class labels respectively. A classification problem over X, Y is 
defined by a function / : X — > Y where f(x) is the true class label of feature x. 

In the handwriting example, X might be the set of all images of a particular size 
and Y would be the set of letters/numbers as well as a special symbol for images 
that correspond to no letter/number. 

Definition 2 (Classification Algorithm). Let / be a classification problem and 
X m C X be the training features on which / will be known. We write fx m to 
represent the function fx m '■ X m — > Y with fx m ( x ) '■= f( x ) f° r & U x £ Xm- A 
classification algorithm is a function, A, where A(fx m ,x) is its guess for the class 
label of feature x G X u := X — X m when given training data fx m - Note we implicitly 
assume that X and Y are known to the algorithm. 
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Definition 3 (Loss function). The loss of algorithm A, when applied to classifica- 
tion problem /, with training data X m is measured by counting the proportion of 
misclassifications in the testing data, X u . 

L A (f,X m ) : = -L l A (fx m ,x) * f{x)] 

where [] is the indicator function defined by, [errpr] = 1 if expr is true and 
otherwise. 

We are interested in the expected loss of an algorithm on the set of all problems 
where expectation is taken with respect to some distribution P. 

Definition 4 (Expected loss). Let M be the set of all functions from X to Y and 
P be a probability distribution on Ai. If X m is the training data then the expected 
loss of algorithm A is 

L A (P,X m ):= J2P(f)L A (f,X m ) 

feM 

3 No Free Lunch Theorem 

We now use the above notation to give a version of the No Free Lunch Theorem of 
which Wolpert's is a generalization. 

Theorem 5 (No Free Lunch). Let P be the uniform distribution on M.. Then the 
following holds for any algorithm A and training data X m C X . 

L A (P,X m ) = \Y-l\/\Y\ (1) 

The key to the proof is the following observation. Let x G X u , then for all 
y G Y, P(f(x) = y\f\x m ) = P{f{ x ) — y) — l/l^l- This means no information can 
be inferred from the training data, which suggests no algorithm can be better than 
random. 

Occam's razor/Kolmogorov complexity. The theorem above is often used to 
argue that no general purpose algorithm exists and that focus should be placed on 
learning in specific domains. 

The problem with the result is the underlying assumption that P is uniform, 
which implies that training data provides no evidence about the true class labels 
of the test data. For example, if we have classified the sky as blue for the last 
1,000 years then a uniform assumption on the possible sky colours over time would 
indicate that it is just as likely to be green tomorrow as blue, a result that goes 
against all our intuition. 

How then, do we choose a more reasonable prior? Fortunately, this question has 
already been answered heuristically by experimental scientists who must endlessly 
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choose between one of a number of competing hypotheses. Given any experiment, it 
is easy to construct a hypothesis that fits the data by using a lookup table. However 
such hypotheses tend to have poor predictive power compared to a simple alternative 
that also matches the data. This is known as the principle of parsimony, or Occam's 
razor, and suggests that simple hypotheses should be given a greater weight than 
more complex ones. 

Until recently, Occam's razor was only an informal heuristic. This changed 
when Solomonoff, Kolmogorov and Chaitin independently developed the field of 
algorithmic information theory that allows for a formal definition of Occam's razor. 
We give a brief overview here, while a more detailed introduction can be found 
in |LV08] . An in depth study of the philosophy behind Occam's razor and its 
formalisation by Kolmogorov complexity can be found in |KLV97t IRH11] . While we 
believe Kolmogorov complexity is the most foundational formalisation of Occam's 
razor, there have been other approaches such as MML [WB68J and MDL [Gru07] . 
These other techniques have the advantage of being computable (given a computable 
prior) and so lend themselves to good practical applications. 

The idea of Kolmogorov complexity is to assign to each binary string an integer 
valued complexity that represents the length of its shortest description. Those strings 
with short descriptions are considered simple, while strings with long descriptions are 
complex. For example, the string consisting of 1,000,000 l's can easily be described 
as "one million ones" . On the other hand, to describe a string generated by tossing 
a coin 1,000,000 times would likely require a description about 1,000,000 bits long. 
The key to formalising this intuition is to choose a universal Turing machine as the 
language of descriptions. 

Definition 6 (Kolmogorov Complexity). Let U be a universal Turing machine and 
x G B* be a finite binary string. Then define the plain Kolmogorov complexity C(x) 
to be the length of the shortest program (description) p such that U(p) = x. 

C(x) := min {£(p) : U(p) = x} 
peB* 

It is easy to show that C depends on choice of universal Turing machine U only 
up to a constant independent of x and so it is standard to choose an arbitrary 
reference universal Turing machine. 

For technical reasons it is difficult to use C as a prior, so Solomonoff introduced 
monotone machines to construct the Solomonoff prior, M. A monotone Turing 
machine has one read-only input tape which may only be read from left to right and 
one write-only output tape that may only be written to from left to right. It has 
any number of working tapes. Let T be such a machine and write T(p) = x to mean 
that after reading p, x is on the output tape. The machines are called monotone 
because if p is a prefix of q then T(p) is a prefix of T(q). It is possible to show there 
exists a universal monotone Turing machine U and this is used to define monotone 
complexity Km and Solomonoff 's prior, M. 
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Definition 7 (Monotone Complexity). Let U be the reference universal monotone 
Turing machine then define Km, M and KM as follows, 



Km(x) := mm{£(p) : U(p) = x*} 
M(x) := J2 2 ^ (P) 

U(p)=x* 

KM(x) := -logM(x) 

where U(p) = x* means that when given input p, U outputs x possibly followed by 
more bits. 

Some facts/notes follow. 

1. For any n, XLee™ M (^) < 1- 

2. Km, M and KM are incomputable. 

3. < KM(x) « Km(x) « C(x) < £(x) + 0(lj 

To illustrate why M gives greater weight to simple x, suppose x is simple then 
there exists a relatively short monotone Turing machine p, computing it. Therefore 
Km{x) is small and so 2~ Km ^ « M(a;) is relatively large. 

Since M is a semi-measure rather than a proper measure, it is not appropriate 
to use it in place of P when computing expected loss. However it can be normalized 
to a proper measure, M norm defined inductively by 

M(xb) 



M(xO) + M(acl) 



Note that this normalisation is not unique, but is philosophically and technically the 
most attractive and was used and defended by Solomonoff. For a discussion of nor- 
malisation, see |LV08l p. 303]. The normalised version satisfies XLgb™ M norm (x) = 1. 

We will also need to define TS/L/KM with side information, M(y;x) := M(y) 
where x* is provided on a spare tape of the universal Turing machine. Now define 
KM(y; x) := — logM(y; x). This allows us to define the complexity of a function in 
terms of its output relative to its input. 

Definition 8 (Complexity of a function). Let X = {xi, ■ ■ ■ , x n } C B k and / : X — > 
B then define the complexity of /, KM(f; X) by 

KM{f; X) := KM \f { Xl ) f (x 2 ) ■ ■ ■ f{x n ); x u x 2 , ■ ■ ■ , x n ) 

An example is useful to illustrate why this is a good measure of the complexity 
of/- 



2 The approximation C(x) ~ Km(x) is only accurate to \ogl(x), while KM w Km is almost 
always very close Gac83 , Gac08 . This is a little surprising since the sum in the definition of M 
contains 2~ Km . It shows that there are only comparitively few short programs for any x. 
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Example 9. Let X C B n for some n, and Y = B and / : X — > Y be defined by 
f(x) = \x n = 1]. Now for a complex X, the string f(xx)f{x2) ■ ■ ■ might be difficult 
to describe, but there is a very short program that can output f(x\)f(x<z) ■ ■ ■ when 
given X1X2 • • • as input. This gives the expected result that KM(f; X) is very small. 

Free lunch using Solomonoff prior. We are now ready to use M norm as a prior on 
a problem family. The following proposition shows that when problems are chosen 
according to the Solomonoff prior that there is a (possibly small) free lunch. 

Before the proposition, we remark on problems with maximal complexity, 
KM(f;X) = 0(\X\). In this case / exhibits no structure allowing it to be com- 
pressed, which turns out to be equivalent to being random in every intuitive sense 
|ML66] . We do not believe such problems are any more interesting than trying to 
predict random coin flips. Further, the NFL theorems can be used to show that no 
algorithm can learn the class of random problems by noting that almost all problems 
are random. Thus a bias towards random problems is not much of a bias (from uni- 
form) at all, and so at most leads to a decreasingly small free lunch as the number 
of problems increases. 

Proposition 1 (Free lunch under Solomonoff prior). Let Y = B and fix a k G N. 

Now let X = B n and X m C X such that \X m \ = 2 n — k. For sufficiently large n 
there exists an algorithm A such that 

LA\M. norm , X m ) < 1/2 

Before the proof of Proposition [TJ we require an easy lemma. 

Lemma 10. Let N C M. then there exists an algorithm such that 

feN feN 
Proof. Let Aj with i 6 {0, 1} be the algorithm always choosing %. Note that 

feN feN 

The result follows easily. □ 

Proof of Proposition^ Now let Aii be the set of all / G M. with f(y) = l\/y G X m 
and M.q = M. — M.%. Now construct an A by 



A(fx m ,x) 



Am (fx m , x ) otherwise 



7 



Let fi G Aii be the constant valued function such that fi(x) = lVx then 



-^A(M norm , X m ) — M norm (/)L J 4(/, X m ) (2) 
feM 

= M norm (f)L A (f,X m )+ M norm (f)L A (f,X m ) (3) 
feMo feMi 

- 9 E M ™(/) + E M- norm (f)L A (f,X m ) (4) 
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- 2 E M ™(/) + E M norm (/) (5) 
/GXo feMi-h 

<^( 1 -^)+ E M ™(/)<^ (6) 
feMx-h 

where ([2]) is definitional, ([3]) follows by splitting the sum into Ai$ and Aii, (jlj) 
by the previous lemma, ([5]) since loss is bounded by 1 and the loss incurred on 
fi is 0. The first inequality of fl6]) follows since it can be shown that there exists 
a 5 > such that M norm (/i) > 5 with 5 independent of n. The second because 
■^/eMi-l/i} M norm (/) and |A^i| is independent of n. □ 

The proposition is unfortunately extremely weak. It is more interesting to know 
exactly what conditions are required to do much better than random. In the next 
section we present an algorithm with good performance on all well structured prob- 
lems when given "good" training data. Without good training data, even assuming 
a Solomonoff prior, we believe it is unlikely that the best algorithm will perform 
well. 

Note that while it appears intuitively likely that any non-uniform distribution 
such as M norm might offer a free lunch, this is in fact not true. It is shown in 
jSVWOlj that there exist non-uniform distributions where the loss over a problem 
family is independent of algorithm. These distributions satisfy certain symmetry 
conditions not satisfied by M norm , which allows Proposition [TJ to hold. 



4 Complexity-based classification 

Solomonoff induction is well known to solve the online prediction problem where the 
true value of each classification is known after each guess. In our setup, the true 
classification is only known for the training data, after which the algorithm no longer 
receives feedback. While Solomonoff induction can be used to bound the number of 
total errors while predicting deterministic sequences, it gives no indication of when 
these errors may occur. For this reason we present a complexity-inspired algorithm 
with better properties for the offline classification problem. 
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Before the algorithm we present a little more notation. As usual, let X = 
{x±,X2, ■ ■ ■ , x n } C B k , Y = B and let X rn C X be the training data. Now define an 
indicator function x by Xi '■= l x i £ A m ]. 

Definition 11. Let / G F x be a classification problem. The algorithm A* is defined 
in two steps. 

/:= axgmm \KM(f;X) : Xi = 1 => />*) = /(a*)} 

Essentially A* chooses for its model the simplest / consistent with the training 
data and uses this for classifying unseen data. Note that the definition above only 
uses the value yi = f(xi) where Xi — 1> an d so it does not depend on unseen labels. 

If KM(f; X) is "small" then the function we wish to learn is simple so we should 
expect to be able to perform good classification, even given a relatively small amount 
of training data. This turns out to be true, but only with a good choice of training 
data. It is well known that training data should be "broad enough", and this 
is backed up by the example below and by Theorem [TH which give an excellent 
justification for random training data based on good theoretical (Theorem |T%|) and 
philosophical (AIT) underpinnings. The following example demonstrates the effect 
of bad training data on the performance of A*. 




Figure 1: A simple problem 

Example 12. Let X = {0000, 0001, 0010, 0011, ■• • ,1101,1110,1111} and f(x) be 
defined to be the first bit of x as in Figure [H Now suppose x = 1 8 8 (So the 
algorithm is only allowed to see the true class labels of x\ through x%). In this case, 
the simplest / consistent with the first 16 data points, all of which are zeros, is likely 
to be f(x) = for all x G X and so A* will fail on every piece of testing data! 

On the other hand, if x = 001010011101101, which was generated by tossing a 
coin 16 times, then / will very likely be equal to / and so A* will make no errors. 
Even if x is zero about the critical point in the middle (xs — X9 — 0) then / should 
still match / mostly around the left and right and will only be unsure near the 
middle. 
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Note, the above is not precisely true since for small strings the dependence of 
KM on the universal monotone Turing machine can be fairly large. However if we 
increase the size of the example so that \X\ > 1000 then these quirks disappear for 
natural reference universal Turing machines. 

Definition 13 (Entropy). Let 9 G [0,1] 



H{9) 



\9\og9 + (1 - 9) log(l -6)} if 6 j£ and 9 + 1 
otherwise 



Theorem 14. Let 9 G (0, 1) be the proportion of data to be given for training then: 

1. There exists a % (training set) such that for all n G N, 9n — C\ < 
#l(Xi:n) < 9n + c\ and nH{9) — ci< KM(xv.n) for some c%, C2 G M + . 

2. For n = \X\, the loss of algorithm A* when using training data determined by 
X is bounded by 

2KM(f;X) + KM(X) + c 2 + c 3 



n(l-9- ci/n) log(l - 9 + c x jnY x 

where C3 is some constant independent of all inputs. 

This theorem shows that A* will do well on all problems satisfying KM(f; X) = 
o(n) when given good (but not necessarily a lot) of training data. Before the proof, 
some remarks. 

1. The bound is a little messy, but for small 9, large n and simple X we get 
L A .(f,X m )<2KM(f;X)/(n9). 

2. The loss bound is extremely bad for large 9. We consider this unimportant 
since we only really care if 9 is small. Also, note that if 9 is large then the 
number of points we have to classify is small and so we still make only a few 
mistakes. 

3. The constants c\,c% and C3 are relatively small (around 100-500). They repre- 
sent the length of the shortest programs computing simple transformations or 
encodings. This is dependent on the universal Turing machine used to define 
the Solomonoff distribution, but for a natural universal Turing machine we 
expect it to be fairly small |Hut04t sec. 2. 2. 2]. 

4. The "special" \ is not actually that special at all. In fact, it can be generated 
easily with probability 1 by tossing a coin with bias 9 infinitely often. More 
formally, it is a fi Martin-L6f random string where /x(l|x) = 9 for all x. Such 
strings form a /i- measure 1 set in B°° . 
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Proof of Theorem [7^} The first is a basic result in algorithmic information theory 
[LV081 p. 318]. Essentially choosing x to be Martin-L6f random with respect to a 
Bernoulli process parameterized by 9. From now on, let 9 = #l(x)/n. For simplicity 
we write x := x x x 2 ■ ■ ■ x n , y := f(x 1 )f(x 2 ) ■ • ■ f(x n ), and y := f(x 1 )f(x 2 ) • ••/W- 
Define indicator ip by ipi := \xi = A yi = yj. Now note that there exists c 3 G IR 
such that 



KM( X i:n) < KM(ip 1:n ; y, y) + KM{y; x) + KM(y; x) + M(x) + c 3 (7) 

This follows since we can easily use y, y and ipv.n to recover xi-.n by x% — 1 if an d 
only if ?/j = yi and ^ 7^ 1. The constant C3 is the length of the reconstruction 
program. Now KM(y; x) < KM(y; x) follows directly from the definition of /. We 
now compute an upper bound on KM(ip). Let a := L*A*(f, X m ) be the proportion 
of the testing data on which A* makes an error. The following is easy to verify: 

1. #l(V) = (l-a)(l-0)n 

2. #O(VO = (l-(l-a)(l-0))n 

3. yi ^ yi =^ ipi = 

4. #l(y (By) = a(l — 9)n where © is the exclusive or function. 

We can use point 3 above to trivially encode ipi when yi 7^ yi. Aside from these, 
there are exactly On O's and (1 — a)(l — 9)n l's. Coding this subsequence using 
frequency estimation gives a code for ip Vn given y and y, which we substitute into 

OZD- 

nH{6) -c 2 < KM{ X v.n) < KM(ifj 1:n , y, y) + KM(y; x) + KM(y; x) 

+ KM(x) + c 3 (8) 
< 2KM{y; x) + KM(x) + n J{0, a) + c 3 

where J(6, a) := [9 + (1 - 0)(1 - a)] H (9/ [9 + (1 - 0)(1 - a)]). An easy techni- 
cal result (Lemma [TBI in the appendix) shows that for 9 G (0, 1) 

< a(l - 9) log ^-L < - J(0, a) 

Therefore na (1 - 5) log ~ < 2KM(y; x) + KM{x) + c 2 + c 3 . The result follows by 
rearranging and using part 1 of the theorem. □ 

Since the features are known, it is unexpected for the bound to depend on their 
complexity, KM(X). Therefore it is not surprising that this dependence can be 
removed at a small cost, and with a little extra effort. 
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Theorem 15. Under the same conditions as Theorem\T^\ the loss of A* is bounded 
by 

2KM(/;X) + 2[log|X| + loglog|X|] + c 

J-'A* [J > ■^m) < 



n(l-9- ci/n) log(l -0 + ci/n)- 1 

where c is some constant independent of inputs. 

This version will be preferred to Theorem [TH in cases where KM(X) > 
2 [log | X | + loglog |X|]. The proof of Theorem [15] is almost identical to that of 
Theorem [T41 

Proof sketch: The idea is to replace equation (j7|) by 

KM( X i:n, x) < KM(il; l:n ; y, y) + KM{y; x) + KM{y; x) + KM{x) + c 3 (9) 

Then use the following identities K(xi-.n] x , K( x )) + K(x) < K(xi:n, x ) — K(£(x)) < 
KM(xi:n, %) where the inequalities are true up to constants independent of x and \- 
Next a counting argument in combination with Stirling's approximation can be used 
to show that for most \ satisfying the conditions in Theorem [T3] have KM(xi- n ) < 
Kixi.n) < K(xi-.n] x , K( x )) + \og£(x) + r for some constant r > independent of x 
and x- Finally use KM(x) < K(x) for all x and K(£(x)) < log £(x) + 2 log log £(x)+r 
for some constant r > independent of x to rearrange (Q into 

KM( X i:n) < KM(il; 1:n ; y, y) + KM{y; x) + KM{y; x) + 2 log £{x) 
+ 21oglog£(a;) + c 

for some constant c > independent of Xi x an d y- Finally use the techniques in 
the proof of Theorem [T3] to complete the proof. □ 



5 Discussion 

Summary. Proposition [1] shows that if problems are distributed according to their 
complexity, as Occam's razor suggests they should, then a (possibly small) free lunch 
exists. While the assumption of simplicity still represents a bias towards certain 
problems, it is a universal one in the sense that no style of structured problem is 
more favoured than another. 

In Section 4 we gave a complexity-based classification algorithm and proved the 
following properties: 

1. It performs well on problems that exhibit some compressible structure, 
KM(f;X) = o(n). 

2. Increasing the amount of training data decreases the error. 

3. It performs better when given a good (broad/randomized) selection of training 
data. 
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Theorem [T4l is reminiscent of the transductive learning bounds of Vapnik and others 
|DEyM04~ , Vap82 , Vap00| , but holds for all Martin-Lof random training data, rather 



than with high probability. This is different to the predictive result in Solomonoff 
induction where results hold with probability 1 rather than for all Martin-Lof ran- 
dom sequences [HM07j . If we assume the training set is sampled randomly, then 
our bounds are comparable to those in |DEyM04|. 

Unfortunately, the algorithm of Section 4 is incomputable. However Kolmogorov 
complexity can be approximated via standard compression algorithms, which may 
allow for a computable approximation of the classifier of Section 4. Such approxi- 
mations have had some success in other areas of AI, including general reinforcement 
learning |VNH + ll] and unsupervised clustering |CV05] . 

Occam's razor is often thought of as the principle of choosing the simplest hy- 
pothesis matching your data. Our definition of simplest is the hypothesis that min- 
imises KM(f;X) (maximises M(f;X)). This is perhaps not entirely natural from 
the informal statement of Occam's razor, since M(x) contains contributions from all 
programs computing x, not just the shortest. We justify this by combining Occam's 
razor with Epicurus principle of multiple explanations that argues for all consistent 
hypotheses to be considered. In some ways this is the most natural interpretation 
as no scientist would entirely rule out a hypothesis just because it is slightly more 
complex than the simplest. A more general discussion of this issue can be found in 
|Dowll[ sec. 4]. Additionally, we can argue mathematically that since KM « Km, 
the simplest hypothesis is very close to the mixture Jf] Therefore the debate is more 
philosophical than practical in this setting. 

An alternative approach to formalising Occam's razor has been considered in 
MML [WB68] . However, in the deterministic setting the probability of the data 
given the hypothesis satisfies P(D\H) = 1. This means the two part code reduces to 
the code-length of the prior, log(l/P(if)). This means the hypothesis with minimum 
message length depends only on the choice of prior, not the complexity of coding 
the data. The question then is how to choose the prior, on which MML gives no 
general guidance. Some discussion of Occam's razor from a Kolmogorov complexity 
viewpoint can be found in [H utlO[ IKLV97j IRHllj , while the relation between MML 
and Kolmogorov complexity is explored in [WD99j. 

Assumptions. We assumed finite X, Y, and deterministic /, which is the stan- 
dard transductive learning setting. Generalisations to countable spaces may still 
be possible using complexity approaches, but non-computable real numbers prove 
more difficult. One can either argue by the strong Church- Turing thesis that non- 
computable reals do not exist, or approximate them arbitrarily well. Stochastic / 
are interesting and we believe a complexity-based approach will still be effective, 
although the theorems and proofs may turn out to be somewhat different. 
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A Technical proofs 

Lemma 16 (Proof of Entropy inequality). 

O<a(l-0)log^ (10) 
< H{6) - [9 + (1 - 9){1 - a)] H (——J— (11) 



e + (i-e)(i-a) 



With equality only if 9 G {0, 1} or a = 



Proof. First, ffTUl) is trivial. To prove (fTTj) . note that for a = or 9 G {0, 1}, equality 
is obvious. Now, fixing 9 G (0, 1) and computing. 



d_ 

da 



H(9) - [9+ (l-9)(l-a))H 
1 - q(l - ^ 



+ (l-0)(l-a) 



(l-a)(l-C) 
> (l-^)log(l-^)- 1 

Therefore integrating both sides over a gives, 

9 



a(l - 0) log(l - 0)" 1 < H{9) - [9 + (1 - 9)(1 - a)] H 



9 + (l-9)(l-a) t 

as required. □ 
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